This is an analysis of US presidential elections data for 2016 at the county level. Since only a small percentage of votes went to independent candidates, we will only compare Democrat and Republican voteshare.
The data for this analysis is taken from https://github.com/tonmcg/County_Level_Election_Results_12-16.
Library imports:
library(dplyr)
library(ggplot2)
library(knitr)
library(readr)
Read in the data:
df <- read_csv("2016_US_County_Level_Presidential_Results.csv")
kable(head(df))
X1 | votes_dem | votes_gop | total_votes | per_dem | per_gop | diff | per_point_diff | state_abbr | county_name | combined_fips |
---|---|---|---|---|---|---|---|---|---|---|
0 | 93003 | 130413 | 246588 | 0.3771595 | 0.52887 | 37410 | 15.17% | AK | Alaska | 2013 |
1 | 93003 | 130413 | 246588 | 0.3771595 | 0.52887 | 37410 | 15.17% | AK | Alaska | 2016 |
2 | 93003 | 130413 | 246588 | 0.3771595 | 0.52887 | 37410 | 15.17% | AK | Alaska | 2020 |
3 | 93003 | 130413 | 246588 | 0.3771595 | 0.52887 | 37410 | 15.17% | AK | Alaska | 2050 |
4 | 93003 | 130413 | 246588 | 0.3771595 | 0.52887 | 37410 | 15.17% | AK | Alaska | 2060 |
5 | 93003 | 130413 | 246588 | 0.3771595 | 0.52887 | 37410 | 15.17% | AK | Alaska | 2068 |
There are 3,141 rows in total, matching the number of counties in the US (Source: http://www.snopes.com/trump-won-3084-of-3141-counties-clinton-won-57/ and http://www.wnd.com/2016/12/trumps-landslide-2623-to-489-among-u-s-counties/).
The dataset contains the following columns:
names(df)
## [1] "X1" "votes_dem" "votes_gop" "total_votes"
## [5] "per_dem" "per_gop" "diff" "per_point_diff"
## [9] "state_abbr" "county_name" "combined_fips"
per_dem
and per_gop
refer to the percentage of votes going to Democrats and Republicans respectively.diff
represents the absolute difference between Republican votes - Democrat votes.per_point_diff
represents this difference as a percentage of total votes.combined_fips
is a 5-digit code identifying the county. (From Wikipedia: The FIPS county code is a five-digit Federal Information Processing Standards (FIPS) code (FIPS 6-4) which uniquely identifies counties and county equivalents in the United States, certain U.S. possessions, and certain freely associated states.)combined_fips
is the column that we are going to use to combine our elections data with our mapping data.
Since we are interested in whether a given county had more Republican or Democrat votes, we have to recompute the diff
and per_point_diff
columns. diff
and per_point_diff
will be positive if there are more Republican votes than Democrat votes (and vice versa).
df <- df %>% mutate(diff = votes_gop - votes_dem,
per_point_diff = diff / total_votes * 100)
Compute percentage of popular vote won by each party:
paste0("Republican % of popular vote: ",
round(sum(df$votes_gop) / sum(df$total_votes) * 100, digits = 1),
"%")
## [1] "Republican % of popular vote: 47.3%"
paste0("Democrat % of popular vote: ",
round(sum(df$votes_dem) / sum(df$total_votes) * 100, digits = 1),
"%")
## [1] "Democrat % of popular vote: 47.5%"
Although Clinton lost the presidential election, she actually won the popular vote!
Compute number of counties won by each party:
df %>% transmute(gop_won = votes_gop > votes_dem) %>%
summarize(gop_won = sum(gop_won))
## # A tibble: 1 x 1
## gop_won
## <int>
## 1 2654
Painting a completely different picture, Trump won 2654 out of 3141 counties (or 84.5% of all counties). Clinton only won 487 counties. This suggests that Clinton won in counties with large populations, or that the margin of victory was slimmer in the counties that Trump won compared with the counties that Clinton won.
We have Clinton winning the popular vote on one hand, but Trump winning many more counties. How can we reconcile these two facts?
One theory is that Clinton won her counties by a huge margin percentage-wise, while Trump won his counties by a slim margin percentage-wise. To test this theory, we could plot a histogram of the per_point_diff
:
ggplot() +
geom_histogram(data = df, mapping = aes(x = per_point_diff)) +
labs(title = "Histogram of % vote margin",
x = "% Republicans won by", y = "Frequency")
The chart does not support the theory that Trump had narrower margins of victory in the counties that he won: he won a sizeable number of counties with > 50% vote difference.
Let’s try plotting a histogram of diff
to look at absolute differences instead:
ggplot() +
geom_histogram(data = df, mapping = aes(x = diff)) +
labs(title = "Histogram of absolute vote margin",
x = "No. of votes Republicans won by", y = "Frequency")
This chart is very different! In the counties that Clinton won, she won it by extremely large margins in terms of absolute votes. Thus, even though she won very few counties compared to Trump, these large margins meant that she could actually win the popular vote.
The code below shows that the top 45 counties with largest absolute vote difference were all won by Clinton (number 46 was Montgomery, TX, which went to Trump).
df %>% select(State = state_abbr, County = county_name, diff) %>%
mutate(abs_diff = abs(diff)) %>%
arrange(desc(abs_diff)) %>%
select(State, County, `Vote difference` = diff) %>%
head(n = 50) %>%
kable()
State | County | Vote difference |
---|---|---|
CA | Los Angeles County | -1273485 |
IL | Cook County | -1088369 |
NY | Kings County | -461433 |
WA | King County | -459368 |
NY | New York County | -456546 |
PA | Philadelphia County | -455124 |
CA | Alameda County | -395162 |
CA | Santa Clara County | -346020 |
NY | Queens County | -334839 |
MA | Middlesex County | -292756 |
FL | Miami-Dade County | -289340 |
MI | Wayne County | -288934 |
FL | Broward County | -288435 |
MD | Prince George’s County | -284337 |
NY | Bronx County | -283979 |
CA | San Francisco County | -277950 |
DC | District of Columbia | -248670 |
MN | Hennepin County | -237515 |
MD | Montgomery County | -226776 |
OR | Multnomah County | -208699 |
OH | Cuyahoga County | -204080 |
TX | Dallas County | -196980 |
VA | Fairfax County | -196648 |
GA | DeKalb County | -191600 |
MA | Suffolk County | -191170 |
CA | Contra Costa County | -180839 |
CA | San Diego County | -180436 |
TX | Travis County | -179725 |
GA | Fulton County | -171503 |
NJ | Essex County | -168972 |
CA | San Mateo County | -165849 |
WI | Milwaukee County | -162895 |
TX | Harris County | -161511 |
MD | Baltimore city | -155836 |
WI | Dane County | -146236 |
OH | Franklin County | -143633 |
NC | Mecklenburg County | -137955 |
FL | Orange County | -134488 |
CO | Denver County | -130974 |
NY | Westchester County | -124027 |
CA | Sacramento County | -110744 |
LA | Orleans Parish | -109566 |
MN | Ramsey County | -106151 |
PA | Allegheny County | -105529 |
NC | Wake County | -104746 |
TX | Montgomery County | 104444 |
NJ | Hudson County | -104365 |
FL | Palm Beach County | -100649 |
MA | Norfolk County | -99958 |
TN | Shelby County | -91692 |
We didn’t go through this section in class as it uses ggplot
in a slightly more involved way. Take a look at the code and see if you can figure out what’s going on.
In the chart below, each vertical bar represents one county.
temp <- df %>% arrange(desc(per_point_diff))
ggplot(data = temp) +
geom_col(mapping = aes(x = as.integer(row.names(temp)),
y = per_point_diff,
fill = per_point_diff)) +
scale_fill_gradient2(low = "blue", high = "red") +
labs(title = "% Difference in votes received by county",
x = "", y = "% Difference") +
theme(legend.position = "right")
The chart above does not support the theory that Trump had narrower margins of victory in the counties that he won, as compared to Clinton.
Let’s make the same chart, but with absolute difference in votes received (instead of percentage difference):
temp <- df %>%
select(diff) %>%
arrange(desc(diff))
ggplot(data = temp) +
geom_col(mapping = aes(x = as.integer(row.names(temp)),
y = diff,
fill = diff)) +
scale_fill_gradient2(low = "blue", high = "red") +
labs(title = "Absolute difference in votes received by county",
x = "", y = "Absolute difference")
When analyzing elections, we have to examine the data from many different perspectives in order to get the full story.